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ABSTRACT 

Some  decision  rules  for  discriminating  among  alternative 
regression  models  are  proposed  and  mutually  compared.  They  are  essentially 
based  on  the  Akaike  Information  Criterion  as  well  as  the  Kullback-Leibler 
Information  Criterion  (KLIC) :  namely,  the  distance  between  a  postulated 
model  and  the  true  unknown  structure  is  measured  by  the  KLIC.  The  proposed 
criteria  combine  the  parsimony  of  parameters  with  the  goodness  of  fit. 
Their  relationships  with  conventional  criteria  are  discussed  in  terms  of 
a  new  concept  of  unbiasedness . 
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1.   Introduction 

In  most  statistical  analyses  it  is  taken  for  granted  that  the 
family  of  the  probability  distribution  functions,  say  F(y|©),  may  be 
correctly  specified  on  a  priori  grounds.   Uncertainty  exists,  therefore, 
only  with  reference  to  the  values  of  parameters  9  involved  in  the  speci- 
fied family  of  probability  distribution  functions  (p.d.f.).   In  practice, 
however,  we  are  seldom  in  such  an  ideal  situation;  that  is,  we  are  more 
or  less  uncertain  about  the  family  to  which  the  true  p.d.f.  might  belong. 
It  may  be  very  likely  that  the  true  distribution  is  in  fact  too  compli- 
cated to  be  represented  by  a  simple  mathematical  function  such  as  is 
given  in  ordinary  textbooks. 

In  practice  we  approximate  the  true  distribution  by  one  of  the  alter- 
native p.d.f. 's  listed  in  textbooks.   Needless  to  say,  we  try  to  choose 
the  most  adequate  p.d.f.  with  due  thought  to  a  priori  considerations.  A 
p.d.f.  specified  by  a  convenient  mathematical  function  is  usually  termed 
a  model.   For  further  analysis  a  postulated  model  is  identified  at  least 
tentatively  with  the  true  distribution.  To  put  it  differently,  in  the 
process  of  conventional  statistical  analysis  a  sharp  distinction  is  sel- 
dom drawn  between  the  postulated  model  and  the  true  distribution. 

To  avoid  the  arbitrariness  that  inevitably  occurs  in  the  process  of 
model  building,  nonparametric  statistical  methods  have  been  extensively 
developed  in  the  past  two  decades.   It  seems  to  me,  however,  that  these 
methods  have  not  been  used  very  successfully  in  practical  data  analysis. 
In  fact,  most  statistical  inferences  are  based  on  some  specific  parametric 
models,  very  often  on  the  model  of  normal  distribution. 
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In  recent  years,  however,  more  and  more  emphasis  has  been  placed 
on  the  problem  of  model  identification;—  that  is,  how  to  identify  the 
model  when  it  cannot  be  completely  specified  from  a  priori  knowledges. 
The  main  purpose  of  the  present  paper  is  to  propose  and  analyze  statisti- 
cal criteria  for  model  identification  in  regression  analysis.  Our  basic 
attitude  toward  the  problem  is  to  recognize  the  fact  that  a  certain 
amount  of  discrepancy  inevitably  exists  between  the  true  distribution 
and  the  model.   The  best  we  can  do  in  trying  to  cope  with  this  sort  of 
situation  is  to  identify  the  most  adequate  model  relatively  among  a  given 
set  of  alternatives.   The  adequacy  of  a  model  needs  to  be  quantified  by 
defining  a  suitable  measure  of  the  distance  of  the  model  from  the  unknown 
true  distribution. 

It  is  expected  intuitively  that  the  more  complicated  model  will 
provide  the  better  approximation  to  reality.  But,  on  the  contrary,  in 
most  practical  situations  the  less  complicated  model  is  likely  to  be 
preferred  if  we  wish  to  pursue  the  accuracy  of  estimation.  To  illustrate 
this  point,  let  us  consider  the  situaiton  where  two  alternative  density 
functions,  f_(»|6)  and  f-(*|c)»  are  given  as  possible  models  of  the  density 
g(0  of  the  unknown  true  distribution,  where  8  and  T,   are  finite-dimensional 
vectors  of  unknown  parameters.  Even  if  f  («|8)  is  the  better  approximation 
to  the  true  density  g(')  in  the  sense  that 

inf  ||  f,0|9)  -  8(0  ||  <  inf  ||  f,(-|c)  -  g(0||  where  ||  •  || 
9  ?     l 

is  a  suitably  defined  distance  measuring  the  difference  between  two  p.d.f.'s, 

it  is  quite  likely  that 

Eq  II  V'le)  ~  8(0  ||  >  E-  ||  f2(*U)  -  g(0  (I  if  dim  6  >  dim  C  where 
0  and  t,   are  some  reasonable  estimates  for  0  and  £,,   respectively. 
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The  above  consideration  leads  us  naturally  to  the  so-called  prin- 
ciple of  parsimony.   That  is,  more  parsimonious  use  of  parameters  should 
be  pursued  so  as  to  raise  the  accuracy  of  estimates  for  unknown  parameters 
in  a  model.   In  general,  closeness  to  the  true  distribution  is  incompatible 
with  parsimony  of  parameters.   These  two  criteria  form  a  trade-off:   if 
one  pursues  one  of  the  criteria,  the  other  must  be  necessarily  sacrificed. 
The  multiple  correlation  coefficient  adjusted  for  the  degrees  of  freedom 
may  be  the  most  commonly  used  statistic  that  incorporates  the  two  incom- 
patible criteria  into  a  single  statistic. 

Akaike  [1]  has  proposed  a  more  general  as  well  as  more  widely  appli- 
cable statistic,  that  ingeniously  incorporates  the  above  two  criteria.  As 
it  is  based  on  the  Kullback-Leibler  Information  Criterion,  Akaike' s 
statistic  is  called  the  Akaike  Information  Criterion  and  is  abbreviated 
as  the  AIC.   Indeed,  the  procedure  developed  here  is  "also  based  on  the 
Kullback-Leibler  Information  Criterion,  but  the  criterion  for  the  choice 
of  the  most  adequate  regression  model  implied  by  our  procedure  is  con- 
siderably different  from  that  implied  by  the  AIC.   The  disagreement  stems 
from,  among  other  things,  a  difference  between  Akaike' s  and  our  views  on 
the  true  distribution. 

Some  readers  may  feel  that  it  is  useless  to  study  the  preliminary 
test  any  more  because  the  resultant  estimator  has  been  proved  to  be 
inadmissible.   To  avoid  this  criticism  in  advance,  we  point  out  that  what 
we  are  proposing  is  not  an  estimation  procedure  but  a  procedure  for  model 
identification.  More  precisely,  in  the  present  context  we  aim  to  develop 
a  procedure  for  identifying  the  most  adequate  model  from  a  given  set  of 
alternatives  rather  than  estimating  unknown  parameters  involved  in  a 
given  true  model. 
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In  Section  2  we  briefly  review  the  Kullback-Leibler  Information 
Criterion  and  the  Akaike  Information  Criterion.   In  Section  3  we  develop 
a  criterion  for  the  choice  of  the  most  adequate  regression  model  and   ; 
compare  it  with  a  criterion  implied  by  the  Akaike  Criterion.   In  Section 
4  a  different  criterion  is  derived  on  the  basis  of  the  minimum  attainable 
Bayes  risk.   The  biases  of  those  criteria  are  discussed  in  Section^,  i 

2.   Information  Criterion 

Suppose  that  we  are  concerned  with  the  probabilistic  structure 

of  a  vector  random  variable  Y'  ■  (Y- ,  Ya,  ...  ,  Y  ).   Let  G(y)  be  the 

12         n 

true  joint  distribution  of  Y.   On  the  basis  of  _a  priori  knowledge  we 
postulate  a  model  F(y|9)  to  approximate  the  unknown  true  distribution 
G(y),  where  8  is  a  finite-dimensional  vector  of  unknown  parameters. 

The  adequacy  of  a  postulated  model  may  be  appropriately  measured 
by  the  Kullback-Leibler  Information  Criterion  (KLIC). 

(2.1)  I(G:F(.|6))  =  EG[log^^-]  =  /  log  ^^y  dG(y) 

where  g  and  f  are  density  (or  probability)  functions  of,  respectively, 

G  and  F;  E  (•)  stands  for  expectation  with  respect  to  the  true  distribu- 
te 

tion  G;  the  integration  is  over  the  entire  range  of  Y.   It  can  be  easily 
shown  that  the  KLIC  is  nonnegatlve. 

(2.2)  I(G:F(-|6))  >  0 

with  equality  only  when  F(yJ6)  =  G(y)  almost  everywhere  in  the  possible 
range  of  Y;  namely,  only  when  the  model  is  essentially  correct.   (See, 
for  instance,  Rao  [7]  pp.  58-59.)   Incidentally,  the  negative  value  of 
the  KLIC  is  termed  the  entropy  of  a  probability  distribution  G(y)  with 
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respect  to  F(y|9).   Noting  the  inequality  (2.2)  as  well  as  an  obvious 
equality 

(2.3)  I(G:F(-|6))  -  /  log  g(y)dG(y)  -  /log  f(y|e)dG(y)  , 

we  are  led  to  propose  the  following  rule  for  a  comparison  of  alternative 

2/ 
models  or  estimates.— 

Rule  2.1:   (i)  A  model  F, ( -  1 6)  is  regarded  as  the  better  approximation 
to  the  true  distribution  G(*)»  i.e.,  the  more  adequate  model  than  an 
alternative  model  F2(*|c)  if  and  only  if 

(2.4)  inf  KGzFjHe))  <  inf  I(G:F?(-|0)  , 

or  equivalently 

(2,.5)         sup  EG  [log  f  (Yje)]  >  su?  EG[log  f^Y^)]  . 
9  Q 

(ii)  Given  a  model  F(«J0),  estimate  6  is  regarded  as  a  better  esti- 
mate than  e„,  if  and  only  if 

(2.6)  Eg   {EG[log  f(Y[81)!61]}  >  Eg  {E^log  f(Y|82)|§2]} 

where  Eg  and  E~   stand  for  expectations  with  respect  to  the  sampling 
distributions  of  9  and  6  ,  respectively.   (Note  that  when  we  first  take 
an  expectation  with  respect  to  G  the  estimate  9  or  9„  should  be  treated 
as  if  it  were  a  constant.) 

In  words,  the  adequacy  of  a  postulated  model  is  measured  by 
the  minimum  possible  KLIC  distance  between  the  model  and  the  true 
distribution. 
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It  was  pointed  out  by  Akaike  [1]  that  if  the  Y'  s  are  independent 
and  identically  distributed  the  maximum  likelihood  estimate  may  be 
regarded  as  an  estimate  that  minimizes  the  estimated  KLIC,  or  equivalently 
maximizes  the  estimated  entropy,  because  the  log  likelihood  function 
divided  by  the  sample  size  n 

1   n 

(2.7)  ~     Z     log  f(y  |e) 

n  j»l        2 
may  be  regarded  as  a  reasonable  estimate  for  E-flog  f(YJ9)}  whatever 

G(y)  is. 

i 

Apparently,  the  above  rule  for  a  comparison  of  models  is  not  directly 
applicable  in  practice,  because  the  criteria  are  totally  dependent  on  the 
unknown  true  probability  distribution-   To  establish  a  practical  usable 
criterion  for  model  identification  on  the  basis  of  the  KLIC,  we  need  to 
replace  unknowns  in  (2.5)  by  their  reasonable  estimates.   In  fact,  the 
Akaike  Information  Criterion  (AIC)  has  been  derived  as  an  approximately 
unbiased  estiijfmte  for  the  KLIC,  neglecting  its  irrelevant  constant  terms 
and  based  implicitly  on  a  fairly  strong  assumption  that  will,  be  stated 
later. 

For  the  sake  of  convenience  in  developing  our  argument  we  give  the 
following  definition: 

Definition:   Given  a  model  F(*J9),  a  parameter  value  9-  such  that 

(2.8)  I(G:F(-|9Q))  <  I(G:F(-|e)) 

for  any  possible  9  in  the  admissible  parameter  space  is  called  a  pseudo- 
true  parameter  value;  F(*J6_)  is  called  a  pseudo-true  model. 
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If  the  true  distribution  G(y)  and  a  model  F(yJB)  satisfy  due 
regularity  conditions,  the  pseudo-true  parameter  6«  must  satisfy 

(2.9)  E^logfCYle)]  -0". 

The  model  F(y|en)  may  be  regarded  as  the  most  adequate  relatively  within 
the  family  of  models  F(y|e)  in  the  sense  that  the  KLIC  for  F(y|8)  is 
minimized  by  F(yJ8  ).  We  note  that  Rule  2.1  is  based  on  the  comparison 
of  the  KLIC  distances  between  the  pseudo-true  models  and  the  true  model. 
Assuming  that  I(G:F(*J8  ))  -  0(n  ),  i.e.,  the  pseudo-true  model  is 
nearly  true,  Akaike  [1]  derives  his  criterion 

(2.10)  AIC(F(«J6))  =  ~2  log  f(y|6)  +  2k 

as  an  almost  unbiased  estimate  for  -2  E_  [log  f(Yj8_)],  where  8  is  the 
maximum  likelihood  estimate  for  8  based  on  observations  y  and  k  is  the 
number  of  the  unknown  parameters,  i.e.,  the  dimension  of  6.   The  procedure 
of  choosing  a  model  that  minimizes  the  AIC  is  called  the  Minimum  AIC  (MAIC) 
procedure.   The  first  term  of  the  AIC  measures  the  goodness-of-fit  of  the 
model  to  a  given  set  of  data,  because  f(y|6)  is  the  maximized  likelihood 
function.   The  second  term  is  interpreted  as  representing  a  penalty  that 
should  be  paid  for  increasing  the  number  of  parameters.   In  this  sense 
the  AIC  may  be  regarded  as  an  explicit  formulation  of  the  so-called  prin- 
ciple of  parsimony  in  model  building. 
Indeed,  the  assumption  that 

(2.11)  I(G:F(.|80))  -  OOT1) 

for  every  model  F  simplifies  the  derivation  substantially,  but  there  is 
no  denying  that  this  simplifying  assumption  lessens  the  plausibility  of 
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the  AIC  to  some  extent.  To  see  this  point  in  more  detail  let  us  consider 
the  case  where  we  have  to  choose  one  from  the  two  alternatives,  say  F 
and  F„«   The  AIC  for  F  is  evaluated  assuming  that  F.  with  pseudo-true 
parameter  value  be  true,  while  the  AIC  for  F  is  evaluated  assuming  that 
F_  with  pseudo-true  parameter  value  be  true.   Thereafter,  the  two  AIC's 
are  numerically  compared.   In  the  next  section,  confining  ourselves  to 
linear  regression,  we  derive  another  criterion  called  the  BIC  on  the 
basis  of  weaker  assumptions  than  (2.11)  and  compare  it  with  the  AIC  to 
see  how  much  difference  might  arise  depending  on  whether  or  not  we 
assume  (2.11). 

3.   Identification  of  a  Regression  Model 

We  are  interested  in  investigating  a  joint  distribution  of  a  vector 

random  variable  Y'  =  (Y, ,  Y0>  ...,  Y  ).   Each  of  Y.'s  may  be  an  observa- 

1   l  n  l  • 

tion  on  a  certain  characteristic  of  a  randomly  chosen  individual;  or  Y.'s 
may  constitute  a  sequence  of  observed  time  series.   The  distribution  func- 
tion G(y)  is  unknown,  but  each  Y.  is  assumed  to  possess  finite  variance. 
We  denote  the  mean  vector  and  the  variance-covariance  matrix,  respectively, 
by  y  and  ft,  where  y  is  a  vector  of  n  components  and  ft  is  a  n  x  n  positive 
definite  matrix.   Unless  we  place  more  a  priori  restrictions  on  the  ele- 
ments of  y  and  ft,  we  can  make  no  inference  at  all  about  the  joint  distri- 
bution of  Y. 

What  we  usually  do  is  to  assume  that  y  belongs  to  a  linear  subspace 
of  lower  dimension  than  n  and  Y.'s  are  mutually  uncorrelated.   Then  we 
have  a  familiar  linear  regression  model 

(3.1)         E(Y)  -  XS,  V(Y)  -  c2I-  , 

n 
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where  X  is  a  n  x  k  matrix  of  known  constants,  the  k  columns  of  which 

constitute  a  basis  of  the  subspace  to  which  u  is  assumed  to  belong;  g 

2 

is  a  vector  of  k  unknown  parameters;  o  is  an  unknown  positive  constant 

1  is  an  identity  matrix  of  order  n.   In  most  practical  situations  the 
n 

columns  of  X  are  vectors  of  observations  on  certain  characteristics  con- 
sidered to  be  associated  with  Y.   Then  the  model  implies  that  the  i-th 

mean  u.  is  represented  as  a  linear  function  of  k  explanatory  variables, 

k 

i.e.,  \i .   »  T.   |3, x..  where  x.  .  is  the  (i,1)-th  element  of  X.   By  assum- 
i   .  ,  1  ij       ij 

ing  a  regression  model  we  can  reduce  the  number  of  unknown  parameters 
from  n  +  n(n  +  l)/2  to  k  +  1. 

In  addition  to  (3.1)  we  often  assume  the  normal  distribution  for  Y 
and  postulate  a  model 


(3.2)         Y  *  N(X3,  cr^I  )  , 

n 


or 

Y  »  Xg  +  u  ,    u  <v-  N(0,  o2I  )  , 

n 

which  is  termed  a  linear  normal  regression  model. 

2 
Lemma  3.1:   The  pseudo-true  values  for  parameters  8*  =  (B',  a  )  are 

(3.3)  S30  *=  (X'xrt'v 

(3.4)  an2   =  -  p'(I  -  X(X'X)"XX')y  +  -  tr  tl  . 

On  n  . 

The  above  results  are  easily  obtained  by  solving  the  equations 

(3.5)  E[^  log  f(YJ6)]  -  0  ; 
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(3.6)  E[~  log  f(Y|e)]  -  0  , 

So 

where  f(y|8)  is  the  density  function  of  N(X8,  a  I)  and  the  expectation 
is  with  respect  to  the  true  distribution.   (All  the  lemmas  and  theorems 
are  proved  in  Appendix.)   Geometrically  speakings  XSQ  is  a  projection 
of  the  unknown  mean  vector  u  into  the  space  spanned  by  the  k  columns  of 
X,  while  naA  is  the  sum  of  the  variances  of  the  Y.'s  plus  the  squared 
length  of  the  perpendicular  from  u  to  the  space.   Speaking  heuristically, 
the  error  of  appro ximating  y  by  X3  is  observed  into  the  error  variance. 
The  maximum  likelihood  (ML)  estimates 

(3.7)  B  -  (X'X)_1X  y  ,       o2  «  ~   y'[I  -  X(X,X)~1X']y 

2 
for  $   and  a     in  the  normal  regression  model  (3.2)  have  the  following 

property. 
Lemma  3.2: 

(3.8)  E(3)  =  60  , 

(3.9)  lim  E  (a2  -  a  Z)  =■  0,  if  Si   =  u)2!^  and  lim  oQ  <  »  . 

This  lemma  implies  that  with  an  incorrect  model  our  objective  is 
the  estimation  of  the  pseudo-true  parameter  values.   To  put  it  differ- 
ently, what  we  ordinarily  call  the  true  parameter  values  are  the  pseudo- 
true  parameter  values  that  minimize  the  distance  between  the  true  unknown 
distribution  and  the  postulated  parametric  model,  where  the  distance  is 

measured  by  the  KLIC.  Moreover,  it  should  be  noted  that  if  Y.'s  are 

7  "2 

uncorrelated,  i.e.,  0  =  u>  I  ,  then  6  and  a  are  uncorrelated. 

n 
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Along  the  lines  of  the  previous  section,  one  can  measure  the  loss 
incurred  by  modelling  G(y)  by  F(yJ6)  with  some  estimate  9  in  place  of 
unknown  6^  by  the  quantity 

(3.10)  W(P(-|9))  -  -  |  EG  [log  f(Y|e)(8l  , 

where  f(y|8  )  is  the  density  function  of  the  pseudo-true  model 

2 

N(Xg  ,  a  I),  i.e.,  the  likelihood  function  of  the  model.   It  should 

be  noted  that  the  expectation  on  the  right-hand  side  of  (3.10)  refers 
only  to  the  argument  Y  of  the  density  function;  i.e.,  6  is  taken  as  a 
fixed  constant, 

Lemma  3.3:   The  loss  incurred  by  modelling  the  distribution  of  Y  by 
F(y|8)  with  an  estimated  value  6  substituted  for  6  is  evaluated  as 

2 

(3.11)  W(F(-|e»  -  log  (2ir)  +  log  (a2)  +  (JU  +  4r  ||  X(g  -  BQ)  !|2 

a     no- 
where ||  ■  ||  is  the  Euclidean  norm. 

In  this  section  we  adhere  to  the  sampling  theory  approach,  and 
hence  we  base  our  decision  about  model  selection  on  the  risk  function 
derived  by  integrating  the  loss  function  with  respect  to  the  sampling 
distribution  of  the  estimate  6.   Since  the  ML  estimate  6  possesses  the 
nice  property  in  Lemma  3.2,  even  when  a  postulated  model  is  incorrect, 
we  define  the  risk  of  postulating  a  model  F(yJ6)  by  an  integral  of  the 
loss  function  of  F(y|s)  with  respect  to  the  sampling  distribution  of  the 
ML  estimate  6. 

2 
Theorem  3.1:   Suppose  that  Q  =  w  I  and  each  Y.  is  symmetrically 

3/ 
distributed  with  the  same  kurtosis  as  a  normal  distribution.—   Then 
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the  risk  of  a  model  F(*J6),  i.e.,  the  expected  value  of  W(F(-J8)),  is 
evaluated  to  order  0(n  )  as 

2         2 

(3.12)  R(F(.|8»  -  log  (2rr)  +  log  (o02)  +  1  +  £±?-  (^~)  -  ^  <-^)  +  Q(n"2) 

2 

The  proof  is  given  in  the  Appendix.   It  should  be  noted  that  an 

u 

decreases  along  with  the  successive  addition  of  explanatory  variables, 
i.e.,  the  increase  of  k. 

To  develop  a  practical  and  useful  criterion  for  model  identifica- 
tion, the  risk  function  involving  unknown  parameters  needs  to  be  somehow 
estimated  from  a  given  set  of  observations. 

Theorem  3.2:   Suppose,  that  we  have  an  estimate,  say  w  ,  for  w  such 

*?     9  —  1/9  -1 19 

that  id  =  id  +  0  (n    ),  where  0  (n  '  )  stands  for  the  term  of 
?  P 

stochastic  order  of  n     and  with  finite  second  order  moment.— 
Then 

'2      "2 

(3.13)  BIC  (F(.|e»  =  -2  log  f  (y|6)  +  2(k  +  2)(^)  -  2(~) 

a  a 

is  an  asymptotically  unbiased  estimate  of  nR(F(-|o)). 

"2    "2  5/ 

If  we  equate  ui  to  a  ,  the  BIC  is  identical  with  the  AIC— '   As 

was  pointed  out  in  the  preceding  section,  the  AIC  is  based  on  the  assump- 
tion that  the  true  distribution  defers  from  the  pseudo-true  model  only 

-1  2     2 

in  the  order  of  n  ;  hence  it  is  justifiable  to  equate  an  to  u>  in 

"2    "2 
(3.12)  or  to  equate  o  to  a  in  (3.13). 

'2  '2 
The  variance  ratio  w  la     increases  with  successive  addition  of 

explanatory  variables,  and  possibly  it  approaches  one  as  long  as  the 

~2  "2 
degrees  of  freedom  are  sufficiently  large.   Its  reciprocal  o  /w   (>_  1) 
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may  be  interpreted  as  a  discounting  factor  for  the  penalty  that  has  to 
be  paid  for  increasing  the  number  of  parameters.   Therefore,  the  favor 
to  parsimonious  models  is  more  pronounced  in  the  minimum  BIC  procedure. 
When  we  compare  two  regression  models,  one  with  less  explanatory  variables 
and  poorer  fit,  the  other  with  more  explanatory  variables  and  better  fit, 
the  BIC  is  rather  more  favorable  to  the  former  model  than  the  AIC.   The 
following  numerical  evaluations  show  that  the  difference  between  the  two 
criteria  is  far  from  negligible. 

Let  us  develop  a  decision  rule  to  choose  one  from  two  nested 
alternative  regression  models 

V  Y  *  N(xiBi>  ai2V  > 


(3.14) 


F9:  Y  <v  N(XR  +  X„B„,  a  2I   )  , 


where  X,  and  X^  are  respectively  n  x  p  and  n  x  q  matrices  of 


Known 


constants,  B  and  B„  are  respectively  p  x  1  and  q  x  1  vectors  of 

2       2 
unknown  parameters,  and  a       and  o_  are  positive  unknowns.   The  true 

2 
distribution  is  assumed  to  be  N(u,  to  I  ) .   In  practice,  we  cannot 

expect  to  obtain  an  estimate  for  u  from  some  independent  source. 

Therefore,  assuming  that  the  more  complicated  model  F~  is  nearly  true, 

2    2  *  2      2     "2 

i.e.,  u  -  a_  «=  o(l),  we  substitute  the  ML  estimate  o  ~  of  o„  for  w 

in  the  BIC's  for  both  models.   Our  decision  rule  is  described  as  follows: 

we  choose  Fj  if  BIC  (F  )  <  BIC  (F?)  and  vice  versa,  where  co  is  replaced 

x.      "1   6/ 
by  or-' 

It  is  straight forward  to  show  that  the  decision  rule  based  on  the 

BIC  is  equivalent  to  a  decision  rule  based  on  the  magnitude  of  the  F-statistic 


(3.15)        W  - 
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(n  -  p  -  qM^"  -  a2   ) 


A  2 
qs2 


which  is  customarily  used  to  test  the  null-hypothesis  g„  =  0.   That  is, 
we  decide  to  choose  F.  if  an  observed  value  of  the  F-statistic  falls 
below  a  critical  point  determined  by  the  inequality,  BIC  (F-)  <  BIC  (F?) 
which  is  equivalent  to 


(3,16) 

where 


n  log  V  -  2(p  +  2)  V  +  2V2  +  2(p  +  q  +  1)  <  0 


°2  q     -1 

(3.17)         V  «  ~  -  [1  +  —2—  W]  L 

"2  n-p-q 

1 
and  choose  F„  otherwise.—  The  critical  point  varies  depending  on  n,  p, 
and  q. 

Confining  ourselves  to  the  case  when  q  =  1,  we  tabulate  the  critical 
points  implied  by  the  minimum  BIC  principle,  say  MBIC  critical  points, 
in  Table  3.1.  As  the  t-statistic  appeals  more  to  our  intuition  rather 
than  the  F-statistic,  these  critical  values  are  with  reference  to  the 
t-statistic,  the  ML  estimate  of  g„  divided  by  its  estimated  standard 
error.  We  decide  to  choose  F-  if  the  observed  value  of  the  t-statistic 
falls  below  the  critical  point  determined  by  the  inequality  (3. 16)  and 
vice  versa. 

It  is  straightforward  to  show  that  AIC(F,  )  <_  AIC(F2)  is  equivalent 
to  the  inequality 


(3.18)        W  <  [exp  (|)  -  l]~2=E_a 


V 

To  examine  how  much  the  MBIC  procedure  differs  from  the  MAIC  procedure, 
the  MAIC  critical  point,  the  right-hand  side  of  (3.18),  is  also  tabulated 
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8/  r~ 

in  Table  3.2.—   Both  of  these  approaches,  albeit  very  slowly,  v2   asymp- 
totically.  The  MBIC  procedure  is  always  more  parsimonious  than  the  MAIC 
procedure  for  a  finite  sample.   We  note  a  remarkable  difference  in  their 
asymptotic  behavior,  namely  that  the  MAIC  critical  point  approaches  v2 
from  below  whereas  the  MBIC  approaches  from  above.  Morevoer,  as  the  number 
of  variables  already  included  increases,  i.e.,  as  p  becomes  large,  the 
MBIC  procedure  increasingly  discriminates  against  the  inclusion  of  addi- 
tional variables,  whereas  the  converse  is  true  for  the  MAIC. 

To  see  a  connection  between  our  procedure  and  the  preliminary  t-test, 
for  some  chosen  cases,  we  tabulate  the  level  of  significance,  i.e.,  the 
probability  that  the  absolute  value  of  the  t-statistic  exceeds  the  critical 
point  when  F  is  true.   Roughly  speaking,  for  moderate  values  of  p,  the 
significance  level  for  the  MAIC  procedure  varies  over  the  wide  range 
from  30%  to  16%  as  the  number  of  degrees  of  freedom  increases;  on  the 
other  hand,  for  the  MBIC  procedure,  it  varies  over  a  relatively  narrow 
range  from  10%  to  16%.   Both  procedures  share  a  common  property  in  their 
more  generous  attitude  toward  inclusion  of  additional  variables  than  the 
traditional  preliminary  test  with  the  significance  level  5%  or  10%.   It 

should  be  noted,  however,  that  these  two  asymptotically  equivalent  pro- 

9/ 
cedures  will  very  often  lead  us  to  different  decisions  for  small  samples. ~ 

Based  on  the  minimax  regret  principle  with  the  squared  error  of  pre- 
diction as  a  loss  function,  Sawa  and  HIromatsu  [S]  calculated  the  optimal 
significance  point  for  the  preliminary  t-test.   Their  minimax  regret 
significance  points  are.  quite  insensitive  to  the  change  in  the  number  of 
degrees  of  freedom.   That  is,  it  remains  constant  at  1.37  to  two  decimal 
places,  unless  the  number  of  degrees  of  freedom  is  extremely  small,  say 
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Table  3.1:   The  MBIC  Critical  Points  and  Significance  Levels 
for  the  Preliminary  t-Test 


\ 


n 


10 


10 

1.646(.144) 

1.816(.119) 

2.036(.097) 

2.264(.086) 

«•■»■*. 

12 

1.59K.146) 

1.715(.125) 

1.882(.102) 

2.092C081) 

— 

16 

1.533(.149> 

1.510(.133) 

1.709 (.116) 

1.836C096) 

2.758(.040) 

20 

1.504C151) 

1,558(.139) 

1.625(.125) 

1.707C.110) 

2.494(.034) 

30 

1.469 (.153) 

1.500 (.146) 

1.536(.137) 

1.576(.128) 

1.912(.071) 

50 

1.445  (.155)' 

1.462(.151) 

1.480(.146) 

1.449(.154) 

1.625(.112) 

100 

1.429(.156) 

1.437(.154) 

1.445 (.152) 

1.453(.150) 

1.499(.138) 

200 

1.421(.156) 

1.425(.156) 

1.429(.154) 

1.433(.154) 

1.453(.148) 

500 

1.417 (.158) 

1.419(.156) 

1.420(.156) 

1.42K.156) 

1.429(.154) 

1000 

1.416(.158) 

1.416 (.158) 

1.417(.156) 

1.418(.156) 

1.42K.156) 

n  is  the  sample  size  and  p  is  the  number  of  the  explanatory  variables 
already  included  in  the  model.   The  decision  rule  is  described  as  follows:   if 
the  t-value  for  an  optional  variable  exceeds  the  MBIC  critical  point,  we  decide 
to  augment  the  model  by  the  optional  variable,  and  vice  versa.   Note  that,  the 
MBIC  critical  point  approaches  slowly  to  /2  as  n  tends  to  infinity  for  every  p. 


Table  3.2:   The  MAIC  Critical  Points  and  Significance  Levels 
for  the  Preliminary  t-Test 


10 


10 

12 

16 

20 

30 

50 

100 

200 

500 

1000 


1.245(.253)   1.153(.293)   1-052(.341) 
1.278(.233)   1.205(.263)   1.127(.297) 


1.316( 
1.337( 
1.364( 
1.385( 
1.400( 
1.407( 
1.41K 


,211) 

,199) 
.184) 
,173) 
,164) 
,160) 
.158) 


1.264 (.230) 
1.297(.2i3) 
1.339 (.192) 
1.370(.177) 
1.393(.166) 
1.404  (.1*2) 
1.410C.160) 


1.21QC.252) 
1.256(.228) 
1.313(.201) 
1.355C.182) 
1.385(.170) 
1.400(.164) 
1.409 (.160) 


1.413(.158)   1.412(.158)   1.411(.158) 


.94K.400) 
1.043(.337) 
1.154(.275) 
1.213(.245) 
1.286(.211) 
1.340 (.187) 
1.378(.172) 
1.396(.164) 
1.407(.160) 
1.41K.158) 


.816 (.452) 
•973(.356) 
1.144(.267) 
1.262 (.214) 
1.34K.184) 
1.378(.170) 
1.400(.162) 
1.407C.160) 


See  the  footnote  to  Table  3.1. 
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less  than  10.   Indeed  it   is  difficult  to  see  a  clear-cut  connection 
between  the  two  basically  different  approaches,  but  it  would  be  worth 

noting  that  if  a  loss  function  is  specified  in  terms  of  the  prediction 

10/ 

error,  the  more  prodigal  model  is  likely  to  be  preferred. — 

We  often  encounter  a  situation  where  we  have  to  choose  one  of  two 
unnested  alternatives: 

Y  ^  N   (X   8. ,    oA    )    and  Y  <u  N    (Xo0,,    o,2I   )    , 
liln  Z22n 

2 
where  the  true  distribution  of  Y  is  N  (u,  m  I  ).   In  this  kind  of  situa- 

n 

■> 
tion  the  unknown  true  variance  u*"  may  be  reasonably  estimated  from  a 

regression  of  y  on  all  the  explanatory  variables  X,UX0.   Another 

2 
reasonable  estimate  of  to  may  be  the  smallest  value  of  "unbiased" 

estimates,  instead  of  the  maximum  likelihood  estimates,  of  variances 

for  all  possible  regressions  of  y  on  a  subset  of  X  (j  X9. 

2 

The  difficulty  in  estimating  w  does  admittedly  place  a  serious 

limitation  to  the  practical  usefulness  of  the  MBIC  procedure.   However, 
it  should  be  noted  that  the  same  difficulty  is  shared  by  Mallows'  [5] 
procedure  which  is  based  on  what  he  calls  C  statistic.   Incidentally, 

Mallows'  procedure  gives  a  decision  rule  essentially  similar  to  the 

11/  2 

AIC. —   It  is  worth  noting  that  according  to  Akaike's  procedure  w  is 

'•   2  ~  2 

estimated  by  a       when  we  evaluate  the  AIC  for  the  model  F  and  by  o 

when  we  evaluate  the  AIC  for  the  model  F?,   This  means  that,  given  a 

class  of  nested  alternative  models,  the  AIC  for  each  model  is  evaluated 

assuming  it  is  nearly  true  in  the  sense  that  the  difference  of  the  error 

9 

variance  in  the  model  from  the  true  variance  w  tends  to  zero  as  n  tends 
to  infinity.   (See  the  equation  (2.11).)   On  the  other  hand,  the  BIC  for 
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each  model  is  evaluated  assuming  that  the  most  complex  model  within  the 
class  would  be  nearly  true  but  the  rest  are  not  necessarily  so. 

4,   A  Decision  Rule  Based  on  Bayes  Risk 

In  this  section  we  look  at  the  problem  another  way.   Given  a  model 
F(«J8)  coupled  with  a  prior  distribution  ?(6)  we  define  the  Bayes  risk, 
say  B(8|F),  for  an  estimate  6  as  the  expectation  of  the  loss  function 
(3.10)  or  (3.11)  with  respect  to  the  posterior  distribution,  that  is, 

(4.1)  B(eJF)  =  /  W(F(-|e))  dP(6|y) 

where  P(8Jy)  is  the  posterior  distribution  for  Q  given  an  observation 
y.   If  there  exists  an  estimate  6*  such  that 

(4.2)  B(6*iF)  -  min  B(6JF)  , 

e 

then  it  is  called  the  Bayes  estimate  of  9  with  respect  to  the  loss  func- 
tion (3.10).  Recalling  that  W(F(-|e))  measured  the  discrepancy  of  a 
model  F(*J6)  from  the  true  distribution  G(-)>  we  take  B(6*JF)  as  a  measure 
of  the  adequacy  of  a  postulated  model  F(*|o)  associated  with  a  prior  dis- 
tribution P(6).   That  iSj  along  the  lines  of  previous  sections,  if  we 
compare  two  alternative  models,  say,  F  («|e)  with  F  (0)  and  F„( « [ c) 
with  P„(c),  then  we  decide  to  choose  F,  or  F^  according  to  whether  or 
not  B(9*|F  )  <  B(c*|F2). 

In  what  follows  let  us  be  specific  to  a  linear  normal  regression 
model  for  a  vector  random  variable  Y: 

(4.3)  F:   Y  ^  N  (X3,  a  I  ) 

n 
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where  Y  is  n  *  1,  X  is  n  *  k,  6  is  k  x  1,  and  u  is  n  *  1;  the  true  dis- 

2  2 

tribution  of  Y  is  N(u,  to  I  )  with  unknowns  u   and  ui  .   If  we  assume  a 

n 

2 
diffuse  prior  for  8  and  a  ,  the  minimum  attainable  Bayes  risk  is 

evaluated  as  follows: 


Lemma  4.1.   Given  a  model  F  with  a  diffuse  prior,  the  minimum  attainable 
Bayes  risk  is 

(4.4)         B(8*,  a2*JF)  -  -  |  log  f  (y||3,  a2)  +  log  (^77^2)  , 

*     "2  2  ~2* 

where  8  and  a     are  the  ML  estimates  for  8  and  a  ,  S*  and  a   are  the 

0 
Bayes  estimates,  and  f  is  the  density  function  of  N(X8»  0   I  ) . 

n 

Let  us  make  a  comparison  of  two  nested  alternatives  F,  and  F  given 
in  (3.14).  The  Bayes  decision  rule,  based  on  the  magnitude  of  the  mini- 
mum attainable  Bayes  risk,  leads  us  to  the  following  decision  rule  which 

1?/ 
is  again  described  in  terms  of  a  familiar  F-statistic. — 


Theorem  4.1.   A  decision  rule  based  on  the  minimum  attainable  Bayes  risk 
is  equivalent  to:   choose  F.  if 

(4'5;  -  (n  +p)(n  -  p  -  q  -  2)  » 

choose  F?  otherwise,  where  W  defined  by  (3.15),  is  an  F-statistic 
conventionally  employed  to  test  the  hypothesis  that  8*  =  0. 

We  call  the  right-hand  side  of  (4.5)  the  Bayes  critical  point, 
which  tends  to  2  asymptotically,  increases  with  q,  and  decreases  with 
p  if  n  is  moderately  large.   Limiting  ourselves  to  the  case  of  q  =  1, 
we  tabulate  the  numerical  values  of  the  square  root  of  the  Bayes  critical 
point  in  Table  4.1  which  is  comparable  to  Tables  3.1  and  3.2. 
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Table  4.1.  Bayes  Critical  Points  and  Significance  Levels 
for  the  Preliminary  t-Test 


V 

2 

3 

4 

5 

10 

10 

1. 499(.178) 

1.441 (.200) 

1.464(.203) 

1.549 (.196) 

«.«■■. 

12 

1.42K.189) 

1.398(.200) 

1.387 (.208) 

1.393(.213) 



16 

1.403(.184). 

1.376(.194) 

1.354(.203) 

1.336 (.211) 

1.387(.224) 

20 

1.399 (.180) 

1.374(.188) 

1.352(.196) 

1.332(.204) 

1.276 (.234) 

30 

1.399(.173) 

1.380 (.179) 

1.362(.185) 

1.345(.191) 

1.270(.220) 

50 

1.403(.167)- 

1.390(.171) 

1.378(.i75) 

1.366(.179) 

1.312(.197) 

100 

1.408(.162) 

1.40K.164) 

1.395(.166) 

1.388(.168) 

1.357(.175) 

200 

1.41K.160) 

1.407(.162) 

1.404(.162) 

1.40K.162) 

1.384(.166) 

1000 

1.414(.158) 

1.413(.158) 

1.412(.158) 

1.41K.158) 

1.408(.159) 

See  the  footnote  to  Table  3.1. 

It  is  interesting  to  note  that  the  Bayes  critical  point  varies 
quite  little  according  to  the  changes  in  the  values  of  n  and  p.  Also, 
it  is  very  close  to  the  minimax  regret  critical  point  in  Sawa  and 
Hiromatsu  [8]. 

5.  Bias  of  Decision  Rules 

Now  we  return  to  Section  3  and  reconsider  the  problem  from  the  view- 
point of  sampling  theory.  When  we  compare  the  two  nested  alternative 
models  given  in  (3.14),  our  decision  rule  should  be  in  principle  based 
on  the  risk  function  given  in  Theorem  3.1.  That  is,  we  should  choose 
Y±   if  R(F1(-|6  ))  <  R(F2(»|6  ))  and  vice  versa. 

2     2     -1 
Lemma  5.1.   If  6  =  a   -  o_  =  0(n  ),  then  • 

2 

(5.1)         R(F1(-  |8  ))  -  RCF2(-  |92))  -  -^  -  "^  +  0  (n~2)  • 

c2    na2 
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The  proof  is  given  in  the  Appendix.   It  should  be  recalled  that 

-2 
when  we  derived  the  BIC  the  terms  of  0(n  )  were  neglected.   It  is, 

therefore,  consistent  that  we  evaluate  the  difference  of  risk  only  to 
order  0(n  ).  The  difference  6  between  the  pseudo-variances  is  assumed 
to  be  0(n  ).  This  assumption  may  seem  to  be  somewhat  uncomfortable. 
However,  it  may  be  justified  by  the  fact  that  the  model  discrimination 
procedure  would  be  unnecessary  unless  the  difference  between  the  two 
alternatives  is  as  small  as  the  reciprocal  of  the  sample  size.   Inci- 
dentally, starting  from  Mallows'  type  risk  function,  Sawa  and  Takeuchi  [9] 
has  arrived  at  the  essentially  same  result  as  (5.1).  This  reflects 
the  asymptotic  equivalence  of  the  two  different  approaches. 

We  can  legitimately  define  a  correct  decision  rule  as  follows: 

2  2 

choose  the  model  F  if  n5/w  £  q  and  choose  F£  if  n6/oi  >  q. 

Based  on  the  preceding  consideration,  we  introduce  the  notion  of 

unbiasedness  of  a  decision  rule;   a  decision  rule  is  said  to  be  unbiased 

2 
if  the  probability  of  choosing  F.  is  greater  than  1/2  when  n<S/o)  '_<  q 

o 
and  less  than  1/2  when  n5/u)  >  q.   If  the  probability  decreases  con- 

2 
tinuously  with  the  increase  of  n6/w  ,  the  condition  of  unbiasedness 

is  simply  described  as  follows:   the  probability  of  choosing  F^.  (or  F^) 

2  2 

is  1/2  when  n6/u>  ■  q.  Note  that  when  n6/cc  =  q  the  two  alternative 

models  are  equally  desirable.  If  the  above  probability  exceeds  1/2, 
then  the  decision  rule  is  said  to  be  biased  toward  a  simpler  model;  if 
it  falls  below  1/2,  then  the  decision  rule  is  biased  toward  a  more  com- 
plex model. 

All  decision  rules  considered  so  far  are  based  on  whether  or  not 
an  observed  value  of  W,  given  by  (3.15),  exceeds  a  constant  which  changes 


;.- 
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2 

with  n,  p,  and  q.   Under  the  assumption  that  Y  *\»  N(u,  oj  I  ),  W  is 

n 

distributed  as  a  doubly  noncentral  F  with  (q,  n-p-q)  degrees  of  free- 
dom and  the  noncentrality  parameters 

a'Xft(ISl»r1»,|i 

(5.2)  ^-Sf  --— ^-2  2    2   ; 

u 

u'[i  -  x  oq x  r1^  -  &(tt'x*)~hs*']v 

(5.3)  52  «  2 » 

u 

where  X*  »  X  -  X  (X'X  )-1X'X  .   It  would  be  worth  noting  here  that  a 
decision  is  correct  if  we  decide  to  choose  F.   when  the  noncentrality 
parameter  of  the  numerator  is  less  than  its  degrees  of  freedom  and  vice 
versa . 

In  Table  5.1  we  tabulate  the  probability  that  W  exceeds  the  BIC 

2 
critical  point  when  n6/cu  =»  q,  i.e.,  when  F.  and  F„  are  indifferent. 

It  can  be  observed  from  the  Table  that  the  BIC  procedure  is  considerably 

biased  toward  a  simpler  model. 

Table  5.1.  Bias  of  the  BIC  Decision  Rule 


noncentrality    n  -  10    n  =  20    n  -  30    n  -  40    n  *  50 


.0 

.696 

.671 

.664 

.661 

.659 

.2 

.742 

.720 

.714 

.711 

.709 

.4 

.781 

.762 

.756 

.753 

.752 

.6 

.814 

.797 

.791 

.789 

.788 

.8 

.842 

.827 

.822 

.820 

.818 

1.0 

.866 

.852 

.848 

.846 

.844 

.0 

.738 

.689 

.675 

.669 

.666 

.2 

.781 

.738 

.725 

.719 

.715 

.4 

.817 

.779 

.767 

.761 

.758 

.6 

.848 

.813 

.802 

.796 

.793 

.8 

.873 

.842 

.832 

.827 

.824 

1.0 

.894 

.866 

.857 

.852 

.850 
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Bach  entry  in  the  table  is  the  probability  that  a  doubly  non-central  F 
variate,  with  noncentrality  parameters  (5  ,  6?)  and  {1,  n  -  p  -  1) 
degrees  of  freedom,  fails  below  the  BIC  critical  point  when  6-  "1. 
The  noncentrality  is  52/(n  -  p  -  1),  i.e.,  the  normalized  noncentrality 
parameter  of  the  denominator  in  F,  where  6.  is  given  by  (5.3). 

The  unbiased  decision  rule  has  been  considered  in  more  detail  by 
Sawa  and  Takeuchi  [9]. 
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Appendix 


Proof  of  Lemma  3.1 


The  log  likelihood  function  is 


(A.l)     iogf(yje)  -  -  |  log  (2ir)  -  |  log  (a2) 

--^Ijy-xejf2, 

2a 

where  9'  =  (@  ' ,  a  )  and  jj  •  }[  stands  for  an  Euclidean  norm. 
Differentiating  it  with  respect  to  g  and  a   „  we  have 

(A.2)     12£|Jpd!i..^x'(y-XB), 

a 

(A.3)     l-io^IzM  .  .  »  +  1  S|  y  „  xg  [|2.  ' 
3  <T         2a    2c^ 

Then 

(A.4)  E[U£B^GLlliJ   .  1-  r(p  _  X0) 

(A.5)  E[3  log  f<Yf6)}   .  _  _n_  +  „1_.  E    ji  Y  _  X3  jj2 

9  o  2e         2o 


-Sj^-^CE   !|Y-yj|24.    ![y-Xe!|2) 


2a         2a 


_  JL.+     1     (tr  Q  +  JJ,,  -  X3  t|2) 


2o  2o 

Equating  (A.4)  and  (A. 5)  to  zeroes  and  solving  them  yields  the 

5_  ana  on 
0      0 


2 
pseudo-true  parameter  values  g_  and  oA  given,  respectively,  by  (3.3) 


and  (3.4). 
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Proof  of  Lemma  3.2 


(A. 6)     E(B)  -  (X'xrt'y  -  3Q 


(A. 7)     E(a2)  -  ~  trPY(uu'  +  cb2I  ) 
n    a         n 


n  -  k  2,1   .— 
— - — u  +-y'Py 
n       n    X 


where  Px  -  I  -  X(X,X)"1X' .  Then 


(A. 8)     lim  E(a2)  =  lim  oQ2 


Proof  of  Lemma  3.3 

From  (A,l)  we  have 

(A. 9)  -  |  log  f(Y|9)   -  log   (2ir)   +  log  o2  +  4j   ||  Y  -  Xg  ||2 

no 

where  Y  Is  a  vector  random  variable  Independent  of  6.  Taking  expectation 
of  (A. 9)  and  substituting 

(A.10)    EC  If  Y  -  X6  ||2|e]  -  Ej]  Y  -  UQ   ||2  -  2  E[(Y  -  X30>'X(e  -  6Q)  ] 

+  |f  xcs  -  B0)  ||2 

-  na02  -  2y'PxX($  -  gQ)  +  [|  X(g  -  3Q)  ||2 

-na02+.||x(3  -  BQ)ir2 
therein,  we  obtain  (3.11). 

Proof  of  Theorem  3.1 

The  risk  function  is 
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(A.ll)  R(F(.  |8))   «  E[W(F(-|8))} 

rj  2  2 

-  log   (2tt)   +  log   (a   )    -  E[log   (~)  ]   +  E(~) 

a  a 

2 
+  -ijE<%)   E||X(6  -  6)  f|2 
na         a 

where  use  is  made  of  the  independence  of  a     and  6,  and  the  suffix  0  of 

2 
afi  end  3_  is  dropped.  We  have  the  following  power  series  expansions: 

"2 
(A. 12)    log  (2-)  -  log  (l  +  A)  =  A  -  -|  A2  +  ••• 

a 

2 

(A. 13)    ~  „  _1_-  ,  1   _  A  +  A2  +  ... 

2    1  +  A 

a 

where 

~2    2 
(A.  14)    A  -    "  ° 

a 

Note  that  under  the  assumptions  stated  in  the  Theorem  the  expectations 

-2 

of  higher  order  terms  in  the  expansions  are  of  order  0(n  ). 

(A. 15)    A  -  -~  [Y'P  Y  -  nw2  -  u'P  u] 
no 

1  2 

«  -~  [V'P  V  +  2y»PV]  -  ~ 
na  a 

where  V  ■  Y  -  u.  Under  the  assumptions  in  the  Theorem 
(A. 16)    E(V'PXV)  -  aj2trPx  -  (n  -  k)u2 

(A.17)    E(V'PXV)2  =  w4[(trPx)2  +  2trPx] 

=  [(n  -  k)2  +  2(n  -  k)]uA 


(A. 18)    E(u'PvV)  -  E[V'PvVp'P  V]  -  0 

A  A      A 
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(A.19)    E(y»PV)2  -  w2u'PxU  -  nw2(c2  -  t/) 

Hence,  rearranging  the  terms,  we  obtain 

.    2 
(A.20)    E(A)  -  -  i-  (SL.)  , 

a 

2  2  2 

(A. 21)    E(A2)  -  ~  (%)  -  f  (%)  +  0(n"2). 

a       a" 

Also,  we  have 
(A. 22)    E||  X(8  -  B)  |!2  -  Ej|  XCX'X)"^^  ||2-  o^trXCX'X)"^' 


.    2 

KU 


Therefore, 


2  2 

(A.23)  E[log   (— )]   +  E   (~)   «  1  +  |     E(A2)   +  0(n'2) 

- i +i  4  -i  42+  °<»~2> 

0  a 

2 
(A. 24)  E   (~)  B||  X(g  -  6)  |!2  ■  kto2  +  0(n_1) 

a 

Substituting  (A.23)  and  (A. 24)  into  (A. 11),  we  finally  obtain  (3.12). 


Proof  of  Theorem  3»2 


From  (A. 12),  (A.20)  and  (A. 21) 

(A.  25)    E  (log  a2)  -  log  a2   +  E(A)  -  ~   E  (A2) 

-  log  a2  -  J  <4>  ~  |  (4  +  J  (4)2  +  0(n"2). 
n   ■<£    n   /    n   z 

a       a       o 

Moreover,  as  w2  =  u2  +  0  (n~   )  by  assumption  and  a  =  a  +  0  (n    ),  we  hav< 

"2,  2 

(A.26)    E(%)  -2L-  [1  +  OCn"1)] 
a     oQ 
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and 

Ml   A  / 

(A.27)    E(%)  «  \  [1  +  0(n_1) 

a  a 

Noting  that 

(A. 28)    -2  log  f(y|e)  -  n  log  (2ir)  +  n  log  a2   +  1 
and  combining  the  above  expectations,  we  obtain 

(A.29)    n  E  [3IC(F(.|e))3  -  nR(F(-|e))  +  OdT1) 

Proof  of  Lemma  4.1 

2 

If  we  assume  a  linear  normal  regression  model  Y  ^>  N(X8,  a   1) 

2 
with  diffuse  prior  for  3  and  o  ,  the  conditional  posterior  distribu- 

tion  of  8,  given  a   ,  is  N  (8,  a  (X'X)  )  where  0  «»  (X'X)  X'y  is  the 

maximum  likelihood  estimate,  and  also  the  marginal  prior  distribution 

2 

for  a  is  the  inverse  gamma  distribution  with  the, density  function 

,      2  v/2  ,  2 

(A>30)   r(v72T  C"T")  "^hi  "P  C  -  — 3D 

a  2a 

2    "2 
where  v  ■  n  -  k  and  s  ■  no  /(n  -  k) .  The  proof  is  given  by  Zellner 

2 

[  10]  .  The  conditional  expectation  of  |j  X(8  -  8)  j|  with  respect  to  the 

posterior  distribution  is 

(A.31)       Egjy>a    tj  X(8  -  8)  j|2  -  EBjy>c    If  X(8  -  8)||2  +   ||  X(8  -  8)  |(2 

>E3Jy>fl    ||X(8-8)||2 

,     2 
-  kc 

where  the  lower  bound  is  attainable  when  8  ■  8;  i.e.,  the  Bayes  estimate 
8  of  8  is  nothing  but  the  ML  estimate.  A  straightforward  integration 


' 


. 


'  .. 
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ylelds 

/a  oon    u    tJl\  v    2     n  -  k    2 

(A.32)    E  2  («  )  -  ^-^  s  *  ^XTT  s 

0  17 

as  long  as  v  >  2.  Hence 

2 
(A.33)    E   „   [W(F(-|6))1  >  log  (2ir)  +  log  o2  +   g-±  ^  ■  (1  +  g)  % 

2    ~2 
The  Bayes  estimate  of  0  is  a  that  minimizes  the  right-hand  side  of  the 

above  inequality;  i.e. 
(A.34)    a   «  n_k,2  a 

"2  2 

where  a     is   the  ML  estimate  of  a    .     Substituting  this   into  the  right-hand 

side  of  (A.33),  the  minimum  attainable  Bayes  risk  is  evaluated  as 

follows : 

(A.  35)  B(3   ,   a      JF)   -  log  2ir  +  log  a  "  +  1 

-  log  2tt  +  log  a     +  1  +  log  (p  °  k  _  2) 

-  -  f  log  f(y|9)   +  log(n;^  2). 


Proof  of  Theorem  4.1 

Let  B.  and  B„  be  the  minimum  attainable  Bayes  risks,  respectively,  for 
F1  and  F.  with  diffuse  prior  for  parameters.  The  difference  between 


B.  and  B?  is 


"  2 
or 


(A.36)    Bl  -  b2  -  los  <^>  +  „.  tg:;^q)-^.v.-^i 

°2 
If  this  is  negative,  we  should  choose  F. ,  and  vice  versa.  By  the 
mono tonicity  of  the  logarithm  transformation,  B-  -  B?  <  0  is 
equivalent  to 
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(A.37)         !i  <  ^SL±JE  +  3iInjIJEJl2). 

a2        (n.  4-  p;(n  -  p  -  q  -  2) 
2 

which  is  again  equivalent  to  (4.5). 
Proof  of  Lemma  5.1 


(A.38)       R(F1(r!e))  -  R(F2(.|e))  -  log  C-ij)  +  *-i-£  (-^  -  -ij) 


2 

a,  2 

a) 
V  V       alZ 


<L  u  1/    1  1    \        4    ^     -  ,    ~2N 

n-2  +  n(—  -~T>   «     +0<n     > 
a2  o2         ax 

If  we  assume  that 

(A. 39)  6  -  ox2  -  ct22  =  OCn"1), 

we  have  an  expansion 

2 
(A.40)   log(-4j)  -  log  (1  +  -— )  -  -~  +  0(n"2). 

°2  °2  °2 

Also,  it  follows  that  the  second  and  third  terms  on  the  right-hand  side 
of  (A.38  )  are  of  order  0(n  ).  Hence,  if  we  neglect  the  terms  of  order 
0(n~  ),  we  can  assert  that  R(F.(-|6))  <  R(F2(«|e))  if  and  only  if 

(A.  41)    2|  <  q 

in 

and  vice  versa. 
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FOOTNOTES 


1.  Regarding  the  importance  of  the  model  identification  in  econometrics, 
the  readers  should  refer  to  excellent  comprehensive  survey  papers 

by  Gaver  and  Geisel  [3]   and  Ramsey  [6] .  Particularly  in  Section 
2  of  Ramsey  [6]>  a  very  illuminating  as  well  as  profound  discussion 
is  given  about  a  concept  of  models. 

2.  In  what  follows,  for  simplicity  of  exposition  F(*J6)  will  be  simply 
called  a  model,  instead  of  a  family  of  models,  except  for  cases, 
when  sharp  distinction  needs  to  be  drawn  between  a  family  of  models 
and  its  particular  element. 

3.  It  is  fair  to  say  that  the  assumption  here  is  nearly  equivalent  to 
assuming  the  normal  distribution. 

4.  The  precise  meaning  of  0  (n  )  is  as  follows:   Given  e  >  0,  if 


there  exists  a  positive  number  X  such  that 
r  e 

PrfixJ  <  A£  n~a}  >  1  -  £  , 

then  we  say  that  X  =  0  (n  ) .  Note  that 
J  n    p 

(i)  0  (n"a)  0  (n~Y)  -  0  (n"a~Y) 
p      p        p 

(ii)  0  (n~a)  +  0  (n"a)  -  0  (n~a)  . 
p        P        P 

Also,  if  E  JX  |k  <  »,  then  E  JX  |k  -  G(n~ka)  . 

1  n'  '  n1 

5.  Note  that  the  number  of  unknown  parameters  is  k  +  1,  i.e.,  k  regression 
coefficients  and  variance. 

6.  It  should  be  here  emphasized  that  the  difference  between  the  AIC 
and  BIG  decision  rules  stems  from  the  following:   the  AIC  for  F, 
is  evaluated  assuming  that  id2  -  a2  -  o(l),  whereas  the  BIC  for 
F,  is  evaluated  without  assuming  that  w2  -  a2  ■  o(l).   See  the 
last  paragraph  of  Section  3.  1 

7.  It  is  impossible  to  explicitly  write  down  the  BIC  critical  point 
as  a  function  of  n,  p  and  q.  However,  for  each  combination  of  n, 
p  and  q,  we  can  evaluate  the  BIC  critical  point  numerically.  Note 
that  the  inequality  BICCF,)  <  BIC(F2)  is  equivalent  to  the  Inequality 
that  the  F  statistic  is  less  than  a  critical  point  determined  by  n, 

p  and  q. 

3.  It  should  be  here  noted  that  the  decision  based  on  the  adjusted 
multiple  correlation  coefficient  is  also  equivalent  to  a  decision 
based  on  the  F-statistic  with  a  constant  critical  point  equalling 
one.  Also  j.  Mallows'  C  statistic  leads  us  to  a  decision  based  on 
the  F-statistic  with  a  critical  point  equalling  two,  irrespective 
of  n,  p  and  q. 


-32- 


9.   The  difference  between  the  AIC  and  the  BIC  ia  more  substantial  for 
a  larger  value  of  q.   In  his  personal  correspondence  Dr.  Akaike 
pointed  out  that  the  two  criteria  give  almost  identical  critical 
points  for  cases  when  p/n  <  0.1.   An  implication  may  be  that  the 
simplifying  assumption  made  by  Akaike  is  virtually  harmless  if 
the  sample  size  is  large  enough  to  satisfy  the  above  condition. 

10.  A  decision  rule  based  on  R,  the  multiple  correlation  coefficient 
adjusted  for  the  degrees  of  freedom,  is  equivalent  to  a  decision 
based  on  F-statistic  with  critical  point  unity  regardless  of  the 
degrees  of  freedom.  (The  proof  is  quite  straightforward.)  This 
decision  rule  is  perhaps  most  often  used  in  practical  regression 
analysis.  The  implied  significance  level  is  a  little  bit  greater 
than  30%.  Presumably 5  this  is  the  most  prodigal  decision  rule. 

11.  Mallows'  C  statistic  is  C  -  RSS  +  2  p  u  ,  where  RSS  is  the 
residual  sum  of  squares,  p' is  the  number  of  explanatory  variables, 
and  u)2  is  an  estimate  of  the  common  variance  of  YJs.   It  is 
straightforward  to  show  that  a  decision  based  on  C  is  equivalent 
to  a  decision  based  on  the  F-statistic  with  a  constant  critical 
point  equalling  two.  Therefore,  the  AIC  and  BIC  decision  rules 
are  asymptotically  equivalent  to  Mallows'  decision  rule. 

12.  In  his  personal  correspondence  Dr.  Akaike  noticed  the  following: 

.    ,   n+k  .    _  (k+1) 

since      log  ( — : — =-)  <v  2  J — 

b     n-k~2  —     n 

if  n  >>  k,  a  decision  rule  based  on  Bayes  risk  is  almost  equivalent 
to  the  MAIC  decision  rule.   This  may  provide  another  justification 
for  the  MAIC  procedure.  It  is  fair  to  note  that  the  decision  rule 
derived  in  this  section  is  considerably  different  from  orthodox 
Bayesian  approach. 
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